Fix populate antijoin to use .proj() for correct pending key computation by hummuscience · Pull Request #1405 · datajoint/datajoint-python

hummuscience · 2026-02-19T16:14:19Z

Summary

Fix _populate_direct() to use self.proj() in the antijoin that computes pending keys
Fix jobs.refresh() to use self._target.proj() when computing new keys for the jobs table
Fix progress() fallback path to use self.proj() in the remaining count

Problem

The antijoin that computes pending keys (key_source - self) does not project the target table to its primary key before the subtraction. When the target table has secondary (non-PK) attributes, the antijoin fails to match on primary key alone and returns all keys instead of just the unpopulated ones.

This causes:

populate(reserve_jobs=False): all key_source entries are iterated instead of just pending ones. Mitigated by if key in self: check inside _populate1(), but wastes time on large tables.
populate(reserve_jobs=True): jobs.refresh() inserts all keys into the jobs table as 'pending', not just truly pending ones. Workers then waste their max_calls budget processing already-completed entries before reaching any real work — effectively making distributed populate non-functional for partially-populated tables.
progress(): reports incorrect remaining counts in the fallback (no common attributes) path.

Reproduction

# Given a Computed/Imported table with secondary attributes:
MyTable.populate(max_calls=5)  # partially populate

# This returns ALL keys, not just unpopulated ones:
pending = MyTable.key_source - MyTable()
print(len(pending))  # == len(key_source), should be len(key_source) - 5

# But this works correctly:
pending = MyTable.key_source - MyTable().proj()
print(len(pending))  # == len(key_source) - 5  ✓

Fix

Add .proj() to the target side of all three antijoins so the subtraction matches on primary key only:

Location	Before	After
`autopopulate.py:406`	`self._jobs_to_do(restrictions) - self`	`self._jobs_to_do(restrictions) - self.proj()`
`autopopulate.py:704`	`todo - self`	`todo - self.proj()`
`jobs.py:373`	`key_source - self._target`	`key_source - self._target.proj()`

Test plan

Added test_populate_antijoin_with_secondary_attrs — verifies pending key count after partial populate (direct mode)
Added test_populate_distributed_antijoin — verifies jobs.refresh() only creates entries for truly pending keys (distributed mode)
Existing test suite passes

🤖 Generated with Claude Code

The antijoin that computes pending keys (`key_source - self` in `_populate_direct`, `key_source - self._target` in `jobs.refresh`, and `todo - self` in `progress`) did not project the target table to its primary key before the subtraction. When the target table has secondary (non-PK) attributes, the antijoin fails to match on primary key alone and returns all keys instead of just the unpopulated ones. This caused: - `populate(reserve_jobs=False)`: all key_source entries were iterated instead of just pending ones (mitigated by `if key in self:` check inside `_populate1`, but wasted time on large tables) - `populate(reserve_jobs=True)`: `jobs.refresh()` inserted all keys into the jobs table as 'pending', not just truly pending ones. Workers then wasted their `max_calls` budget processing already-completed entries before reaching any real work. - `progress()`: reported incorrect remaining counts in some cases Fix: add `.proj()` to the target side of all three antijoins so the subtraction matches on primary key only, consistent with how DataJoint antijoins are meant to work. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dimitri-yatsenko · 2026-02-19T19:48:51Z

Thanks for digging into this — the .proj() fix on the antijoin is a sound defensive pattern and the right thing to do regardless.

One question on the motivation: I traced through the code and with the current test fixture (Experiment with Subject as parent), key_source.proj() yields {subject_id} and self (Experiment) shares only subject_id as a common attribute name. The antijoin already matches on PK only in this case, with or without .proj(). Could you share a concrete example (table definitions + reproduce steps) where key_source - self actually returns wrong results today? I'd like to understand the exact conditions — e.g., a custom key_source that retains secondary attributes, or a naming collision between key_source attributes and target secondary attributes.

On the CI failures — two issues to fix:

Test logic: Experiment.make() inserts fake_experiments_per_subject = 5 rows per key, so populate(max_calls=2) processes 2 subjects and produces 10 rows, not 2. The assertions need to account for this: assert len(experiment) == 2 * experiment.fake_experiments_per_subject.
Lint: ruff-format wants the final assert in test_populate_antijoin_with_secondary_attrs collapsed to a single line. Running pre-commit run --all-files locally will fix it.

Happy to help get these sorted if you'd like — just let me know.

… attrs - Fix assertion counts: Experiment.make() inserts fake_experiments_per_subject rows per key, so populate(max_calls=2) produces 10 rows, not 2 - Add test_populate_antijoin_overlapping_attrs: self-contained test with Sensor/ProcessedSensor tables that share secondary attribute names (num_samples, quality), reproducing the exact conditions where the antijoin fails without .proj() - Run ruff-format to fix lint Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…en distributed test - make() only receives PK columns -- fetch source data from Sensor() instead - Use Schema.drop(prompt=False) instead of drop(force=True) - Use decimal types instead of float to avoid portability warnings - Remove test_populate_distributed_antijoin: Experiment non-FK experiment_id degrades job granularity, making the assertion unreliable Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

hummuscience · 2026-02-20T10:13:18Z

Thanks for the quick response. Excuse the oversight, it was the end of a long day :) I updated everything to reflect the bug.

The exact conditions that trigger the bug:

The antijoin key_source - self matches on ALL common column names. When key_source returns secondary attributes that share names with the target's secondary attributes, SQL matches on those too. If values
differ, no match is found and all keys appear "pending."

Here are the table definitions from our pipeline:


 class LightningPoseOutput(dj.Computed):
     definition = """
     -> LightningPoseTask
     ---
     lp_output_path : varchar(512)
     processing_datetime : datetime
     num_frames = null : int unsigned
     quality = 'good' : enum('good', 'review', 'exclude')
     ...
     """

 class InitialContainer(dj.Computed):
     definition = """
     -> LightningPoseOutput
     ---
     container_path : varchar(512)
     processing_datetime : datetime          # same name, different value
     num_frames : int unsigned               # same name, same value
     quality = 'good' : enum('good', 'review', 'exclude')  # same name, same value
     ...
     """

     @property
     def key_source(self):
         return LightningPoseOutput() & production_models

@Property
def key_source(self):
return LightningPoseOutput() & production_models

When populate() computes pending keys:

key_source = LightningPoseOutput() → includes processing_datetime, num_frames, quality
self = InitialContainer() → also has processing_datetime, num_frames, quality
key_source - self matches on ALL common columns including those 3 secondary attrs
Since LP's processing_datetime ≠ Container's processing_datetime, no match → all keys appear "pending"

Why we designed it this way:

In this case, InitialContainer simply converts the Lightning Pose CSV output into a https://github.com/neuroinformatics-unit/movement xarray container (NetCDF). The underlying data is the same — same video,
same frames, same quality assessment — so the metadata columns (num_frames, quality, processing_datetime) naturally share the same names. The only value that actually differs is processing_datetime, which
records when each step ran. Having the same column names across parent and child tables is a natural result of a pipeline where each stage wraps or transforms the previous stage's output, and populate() should
determine pending keys based on primary key identity regardless.

That said — if there's a recommended DataJoint pattern for this kind of multi-stage pipeline, let me know :)

dimitri-yatsenko · 2026-02-20T19:40:04Z

Thanks for the detailed explanation — the bug is clear now and it's a real issue.

Caveat: This only triggers when two conditions coincide: (1) a custom key_source that returns secondary attributes (the default key_source already calls .proj()), and (2) the child table has secondary attributes with the same names as the parent's. In your case, LightningPoseOutput and InitialContainer both have processing_datetime, num_frames, and quality, and the custom key_source returns the full LightningPoseOutput() with all those secondary attributes exposed.

User-side fix: You can also resolve this by projecting your custom key_source to primary key only:

@property
def key_source(self):
    return (LightningPoseOutput & production_models).proj()

This strips the secondary attributes before they reach the antijoin, matching the behavior of the default key_source. It's good practice for any custom key_source — return only the primary key attributes that define the work units.

On semantic matching: In DataJoint 2.0, attribute lineage tracking would have caught this. The processing_datetime in LightningPoseOutput (recording when LP ran) and processing_datetime in InitialContainer (recording when the container conversion ran) originate from different computational steps and would have different lineages. The antijoin's assert_join_compatibility check would raise an error on the lineage mismatch rather than silently matching on the wrong columns. This is exactly the class of bugs that semantic matching was designed to prevent.

The .proj() fix in this PR is still the right defensive measure — populate() shouldn't rely on users remembering to project their key_source. CI is green, the regression test is solid. Looks good to merge.

dimitri-yatsenko · 2026-02-20T21:14:51Z

Follow-up on semantic matching: I mentioned that DataJoint 2.0's semantic matching would have caught this, but it turns out it depends on whether lineage tracking is active in your schema.

Semantic matching relies on the ~lineage table, which stores attribute origins. If your schema was created before lineage tracking was added (or schema.rebuild_lineage() was never run), the ~lineage table won't exist. In that case, assert_join_compatibility silently degrades — it logs a warning and skips the check entirely (condition.py:256):

if not expr1.heading.lineage_available or not expr2.heading.lineage_available:
    logger.warning("Semantic check disabled: ~lineage table not found. ...")
    return  # skips the check

So if your pipeline's schemas don't have ~lineage tables, the semantic check was never enforced, and the overlapping processing_datetime attributes were silently matched by the antijoin without any error or warning about lineage conflicts.

To enable semantic matching on existing schemas: schema.rebuild_lineage(). After that, the antijoin key_source - self would raise a DataJointError on the lineage mismatch for processing_datetime rather than silently producing wrong results.

Muad Abd El Hay and others added 2 commits February 20, 2026 10:34

dimitri-yatsenko self-requested a review February 20, 2026 21:31

dimitri-yatsenko approved these changes Feb 20, 2026

View reviewed changes

dimitri-yatsenko merged commit dfc0e15 into datajoint:master Feb 20, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix populate antijoin to use .proj() for correct pending key computation#1405

Fix populate antijoin to use .proj() for correct pending key computation#1405
dimitri-yatsenko merged 3 commits intodatajoint:masterfrom
hummuscience:fix/populate-antijoin-proj

hummuscience commented Feb 19, 2026 •

edited by dimitri-yatsenko

Loading

Uh oh!

dimitri-yatsenko commented Feb 19, 2026

Uh oh!

hummuscience commented Feb 20, 2026 •

edited

Loading

Uh oh!

dimitri-yatsenko commented Feb 20, 2026

Uh oh!

dimitri-yatsenko commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

hummuscience commented Feb 19, 2026 • edited by dimitri-yatsenko Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Reproduction

Fix

Test plan

Uh oh!

dimitri-yatsenko commented Feb 19, 2026

Uh oh!

hummuscience commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dimitri-yatsenko commented Feb 20, 2026

Uh oh!

dimitri-yatsenko commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hummuscience commented Feb 19, 2026 •

edited by dimitri-yatsenko

Loading

hummuscience commented Feb 20, 2026 •

edited

Loading